Usage

This suite of tools is compatible with CRISPResso v.2.0.31- v.2.0.37 and has the following modes:

Note that these tools are incompatible with prime editing.


CRISPResso2_downstream processing


Setup

Download CRISPResso2_downstream source files from the CRISPResso2 Github repository.

The necessary package dependencies are best installed in a conda environment.

Use one of the crispresso_downstream_env.yml files to generate a crispresso_downstream_env environment:

For local use on Mac OSX systems:

conda env create -f crispresso_downstream_env_osx.yml

For use on Unix/Linux systems:

conda env create -f crispresso_downstream_env_non_osx.yml

The environment contains the following dependencies:

python=2.7.15  
R>=3.5.1  

r-optparse  
r-tidyselect  
r-tidyverse  
r-RColorBrewer  
r-grid  
r-gtable  
r-extrafont  
r-scales  
r-effsize  
r-cowplot

Once the conda environment is generated, activate it with the following command:

conda activate crispresso_downstream_env

To deactivate the conda enviroment after running the analyses, input the following:

conda deactivate crispresso_downstream_env

All commands must be input from within the CRISPResso2_downstream source directory.


Sample data

Sample off-target CRISPRessoPooled outputs and analysis inputs from Zeng_et_al.1 (https://www.nature.com/articles/s41591-020-0790-y?draft=collection) have been provided in the Sample_data.zip file. The input sequences were generated using IDT’s rhAMPSeq panels. All sample inputs and outputs provided in this documentation are drawm from the same dataset and analysis.

Sample_data.zip contains the ref_seq_csv (1620_ref_seqs_tb.csv) file, the ot_sample_csv (2019_1620_BE_rhAMPSeq_samples.csv) file, and eight CRISPRessoPooled output directories.

1Zeng, J., Wu, Y., Ren, C. et al. Therapeutic base editing of human hematopoietic stem cells. Nat Med 26, 535–541 (2020). https://doi.org/10.1038/s41591-020-0790-y


Parameter list

Also available by running Rscript crispresso_downstream_v2.0.37.R -h

-c, –crisp_out_dir: path to folder storing CRISPRessoPooled output directories

-d, –dataID: data identifier common to all names of CRISPRessoPooled input amplicons ( the name following “CRISPResso_on_” ), regular expressions accepted

-m, –mode: analysis modes [default collapse_only]:

  • collapse_only = generate collapsed allele table only
  • collapse_BE = generate collapsed allele table & base editing summary (requires BE parameters)
  • collapse_OT = generate collapsed allele table & off-target summary (requires OT parameters)
  • collapse_BE_OT = generate collapsed allele table, base editing summary, and off-target summary (requires BE and OT parameters)
  • BE_OT = generates base editing and off-target summary (requires BE parameters, OT parameters, and collapsed allele tables)
  • BE_only = generates base editing summary (requires BE parameters and collapsed allele tables)
  • OT_only = generates off-target summary (requires OT parameters and collapsed allele tables)

-p, –percent_freq_cutoff: minimum frequency an allele must appear in any sample/amplicon to be included in the collapsed output table
[default 0]

–CRISPRessoBatch: for any collapse mode: Use CRISPRessoBatch files as inputs [default FALSE]

–CRISPRessoPooled: for any collapse mode: Use CRISPRessoPooled files as inputs [default FALSE]

-n, –noSub: do not include substitutions as edits in output table [default FALSE]

-f, –conversion_nuc_from: for any BE mode: the nucleotide targeted by the base editor [default C]

-t, –conversion_nuc_to: for any BE mode: the nucleotide(s) produced by the base editor. If multiple nucleotides, enter all letters without separators (ex. ATG) [default T]

-s, –ot_sample_csv: for any OT mode: path to sample csv file (see “Input files” below for details)

-r, –ref_seq_csv: for any OT mode: path to reference and guide sequences csv file (see “Input files” below for details)

-b, –be_summary_exists: for OT_only mode: add this flag if BE summaries already exist AND set [-f, -t] if not default values) [default FALSE]

-e, –scale_size_by_editing_freq: for OT_only mode: add this flag to separate the points in OT % editing scatterplot by size according to sample read coverage [default FALSE]

-l, –low_coverage: for OT_only mode with –editing_freq_scale flag: the upper read count cutoff for “low-coverage” amplicons/samples [default 1000]

-u, –high_coverage: for OT_only mode with –editing_freq_scale flag: the lower read count cutoff for “high-coverage” amplicons/samples [default 10000]

 


Guidelines for CRISPResso inputs

Follow the CRISPResso2 documentation to run either CRISPRessoBatch or CRISPRessoPooled.

Using with CRISPRessoPooled

  1. When running CRISPRessoPooled and multiple guides are used for the same amplicon, the full names of all guides must be included and separated by underscores in pooled.txt (ex. when guides/targets 1620_OT_10 and 1620_OT_14 are on the same amplicon, name the run “1620_OT_10_1620_OT_14”).

  2. For easy identification and sorting, it is best to name off-targets guide_OT_#### (ex. 1620_OT_12) beginning with the on-target as “0” (ex. 1620_OT_000).

  3. Set –min_reads_to_use_region lower if low read coverage is expected for any amplicon as amplicons with fewer reads aligned will not be included in the analysis.


Input files

All input files (ref_seqs_csv and ot_sample_csv) should be in the same directory as the CRISPRessoBatch/CRISPRessoPooled run output files.  

Off-target sample csv (ot_sample_csv)

File linking CRISPRessoPooled output files to biological samples/conditions necessary for off-target analysis (not necessary for collapse or BE modes). As the off-target visualization tool takes up to 30 samples, this table may have up to 30 rows (not including headers). Input headers exactly as they are shown in the sample table and column descriptions below.

donor condition CRISPResso_dir_name sample_name R_color R_fill R_shape
Donor_6 mock all_merged_EP1116-Mock_S1_mix Donor_6 mock firebrick1 firebrick1 1
Donor_6 edited all_merged_BE1116-1620-1x_S2_mix Donor_6 1EP firebrick1 firebrick1 16
Donor_6 edited all_merged_BE1116-1620-2x_S3_mix Donor_6 2EP firebrick1 firebrick1 16
Donor_7 mock all_merged_BE1215-1-Mock_S4_mix Donor_7 Mock cornflowerblue cornflowerblue 1
Donor_7 edited all_merged_BE1215-2-1620-1x_S5_mix Donor_7 1EP cornflowerblue cornflowerblue 16
Donor_7 edited all_merged_BE1215-3-1620-2x_S6_mix Donor_7 2EP cornflowerblue cornflowerblue 16
Donor_8 mock merged_BE0108-1-Mock_S7_mix Donor_8 Mock darkgoldenrod1 darkgoldenrod1 1
Donor_8 edited all_merged_BE0108-9-1620-2x_S8_mix Donor_8 2EP darkgoldenrod1 darkgoldenrod1 16

Necessary columns:

  1. donor = cell donor ID + technical replicate (if applicable)

  2. condition = either “mock” or “edited”

  3. CRISPResso_dir_name = names of CRISPRessoPooled output directories following CRISPRessoPooled_on_

  4. sample_name = sample names to be displayed in off-target output figure (text separated by spaces will be displayed on separate lines in the final figure)

Optional aesthetics columns (see options below):

  1. R_color = R grDevices::colors color names or hex codes indicating the colors differentiating CRISPRessoPooled runs/samples in the off-target editing dotplot. If the column does not exist, default colors are generated.

  2. R_fill = R grDevices::colors color names or hex codes, same as R_color options. By default, R_fill options are the same as R_color options.

  3. R_shape = R ggplot2 shape options. By default, control/mock samples are unfilled circles (16), and edited samples are filled circles (1).

 

R color and shape values
R_color

   

R_shape

   

Off-target reference sequences csv (ref_seq_csv)

File containing the reference amplicon sequence, guide sequence, and PAM sequence for each off-target. The table is very similar to the CRIPSRessoPooled pooled.txt file and is necessary for off-target analysis (not necessary for collapse or BE modes). Unlike the CRISPRessoPooled pooled.txt file, which allows identification of multiple guides per reference amplicon, the ref_seq_csv requires that each guide must be in an individual row.
Input headers exactly as they are shown in the sample table and column descriptions below.

ot_id amplicon_sequence aligned_guide_seq guide_sequence pam
1620_OT_1 CAGGTAATAACATAGGCCAG… TTTATCACAGGCTCCAGGAA TTTATCACAGGCTCCAGGAA GGG
1620_OT_10 GCCCAACCAAATCAATATGA… TTTGTCACAGTCTTCAGGAA TTTGTCACAGTCTTCAGGAA AGG
1620_OT_11 AGATTTCAAGACAAA… TTTATCTAATGCTCCAGGAA TTTATCTAATGCTCCAGGAA AAG
1620_OT_12 ACACTCATACCTCCCGTTT… TGTATGACAGGCTCCGGGAA TGTATGACAGGCTCCGGGAA AAG
1620_OT_13 TCCACAGTCCTTGTACT… TTGATCACAGGCATCAGGAA TTGATCACAGGCATCAGGAA CAG
1620_OT_14 ACCAGCAGCTGAGAGAAA… TTGATCTCAGGCACCAGGAA TTGATCTCAGGCACCAGGAA CGG

Input columns:
Each off-target “guide” should be in a separate row regardless of whether they are on the same amplicon. With the exception of ot_id and guide_sequence, all sequences must be uppercase ATG.

  1. ot_id = off-target names (should match the names in the CRISPRessoPooled pooled.txt file exactly)

  2. amplicon_sequence = from the CRISPRessoPooled pooled.txt file

  3. aligned_guide_seq = 20-bp guide sequence in the CRISPRessoPooled pooled.txt file (no bulge placeholders “-” or lowercase letters)

  4. guide_sequence = guide sequence to be displayed in off-target output figures (can contain bulge “-” placeholders and lowercase letters). The sequences in this column must be unique.

  5. pam = PAM sequence to be displayed in off-target output figures (do not need to be cannonical PAMs)


Summarize and collapse editing results (Collapse mode)

The Collapse mode uses the Alleles_frequency_table_around_sgRNA_[guideseq].txt output by CRISPResso to summarize each indel in more readable terms (deletion lengths and locations, insertion sequences and locations, substitution conversions and locations) with the first nucleotide of the guide as base pair index 1. Only modifications within the set quantification window are counted.

Example Alleles_frequency_table_around_sgRNA_[guideseq].txt

Unlike in the CRISPResso allele tables, each row is an unique indel, not a unique read. The Collapse mode collapses all reads with the same indel. Each indel must appear at or greater than the percent_freq_cutoff set by the user in at least one sample, or it is filtered out of the table before all allele frequencies are renormalized to 100%.
 

Example

collapse_only mode

Rscript crispresso_downstream_v2.0.37.R -c /Users/local_Jing_BE/1620/20200711_1620_ONESeq_rhAMPSeq_triplicates -d OT -m collapse_only -p 0 --CRISPRessoPooled

 

Output

[CRISPResso_RUN_NAME]_collapsed_[percent_freq_cutoff].csv
Table of all collapsed indels across all samples in the CRISPRessoBatch or CRISPRessoPooled run generated from Alleles_frequency_table_around_sgRNA_[guideseq].txt (CRISPResso output).

 

Columns:

  • Aligned_Sequence = example Aligned_Sequence around the quantification window center for one sample (kept as an artifact of CRISPResso output)

  • Reference_Sequence = example Reference_Sequence around the quantification window center for one sample (kept as an artifact of CRISPResso output)

  • Unedited = logical indicating whether the row contains the frequency of unedited alleles

  • n_deleted = number of base pairs deleted in deletions overlapping the quantification window

  • n_inserted = number of base pairs inserted in insertions overlapping the quantification window

  • n_mutated = number of base pairs modified in substitutions overlapping the quantification window

  • indel = detailed summary of the indel within the quantification window set in CRISPResso (sorted alphanumerically, notation explained in section below)

  • *[run]_reads__[GUIDE_SEQ]* = number of reads of the indel (row) within the sample/CRISPResso run (column)

  • *[run]__[GUIDE_SEQ]* = percent frequency of the indel (row) within the sample/CRISPResso run (column)

     

Indel notation
In the indel column, base pair indexes are centered at the beginning of the aligned spacer sequence, with the first base pair of the spacer distal to the PAM counted as 1 (see figure below).

The summarized mutation notation in the indel column, which is consistent in all output files generated by this tool, is as follows:

  • Substitution: (ex. C6T, G10A) the first base pair is the reference nucleotide, the middle number is the base pair index, and the last nucleotide is the aligned nucleotide; C6T means that the C at the 6th bp in the spacer was converted to T

  • Deletion: (ex. -1 (18), -5 (19-23)) the minus indicates a deletion, the number immediately following the minus sign displays the deletion length, and the numbers in the parentheses indicate the bp indexes of the deletion; -1 (18) is a minus one deletion at the 18th bp of the spacer sequence

  • Insertion: (ex. +1 (T 17), +9 (GTTCCAGAG 11-19))) the plus indicates an insertion, the number immediately following the plus sign displays the insertion length, the nucleotides inside the parentheses make up the inserted sequence, and the bp index range following the sequence shows the location of the insertion; +1 (T 17) is a plus one insertion at the 17th bp of the spacer sequence

While a single allele may multiple substitutions, only one indel (insertion/deletion) is identified within the quantification window as an insertion and a deletion cannot take place at the same site. Each combination of substitutions and single indels are counted as individual alleles (and individual rows in the collapsed allele tables).

   

Filter base editing results (BE mode)

The BE mode filters the collapsed allele tables by indel to only account for base editing (target and resulting nucleotides set by -f and -t parameters) in the first 3-10 bps the guide and for indels in overlapping the 17-18 bps, the Cas9 cut site. All other edits are counted as “Unedited” in the output table.

 

Example

collapse_BE mode

Rscript 20200610_comLine_tester.R -c /Users/local_Jing_BE/1620/20200619_1620_ONESeq_rhAMPSeq_triplicates -p 0 -d 1620 -m collapse_BE -f C -t T --CRISPRessoPooled

BE_only mode

Rscript crispresso_downstream_v2.0.37.R -c /Users/local_Jing_BE/1620/20200619_CRISPRessoBatch_1620_OT_111 -p 0 -d Donor -m BE_only -f C -t ATG

 

Output

[CRISPResso_RUN_NAME]_BE_summary_[conversion].csv
Filtered allele table. The columns are the same as those of the collapsed allele tables (listed above.)

   

Visualize potential off-target editing (OT mode)

OT mode is generated for the purpose of visualizing off-target editing, but it can be re-purposed for any pool of amplicons ran in CRISPRessoPooled.

The OT mode performs simple t-test comparing percent editing in edited v. control samples and generates composite figures summarizing off-target editing.

 

Example

collapse_BE_OT mode

Rscript crispresso_downstream_v2.0.37.R -c /Users/local_Jing_BE/1620/20200619_1620_ONESeq_rhAMPSeq_triplicates -p 0 -d 1620 -m collapse_BE_OT -f C -t ATG -s 20200619_1620_rhAMPSeq_samples.csv -r 1620_ONESeq_ref_seqs.csv --CRISPRessoPooled

collapse_OT mode

Rscript crispresso_downstream_v2.0.37.R -c /Users/IND_off_target/2020_CRISPResso2/20200710_DE_1450_rhAMPSeq -d OT -m collapse_OT -p 0 -n -s 202006_DE_1450_rhAMPSeq_samples.csv -r 202006_1450_0000_ref_seqs.csv --CRISPRessoPooled

OT_only mode (BE summary exists)

Rscript crispresso_downstream_v2.0.37.R -c /Users/local_Jing_BE/1620/20200326_1620_ONESeq_rhAMPSeq_CRISPRessoPooled -p 0 -d 1620 -m OT_only -f C -t T -s ../BE_rhAMPSeq_samples.csv -r ../1620_ONESeq_ref_seqs.csv --CRISPRessoPooled -b

OT_only mode

Rscript crispresso_downstream_v2.0.37.R -c /Users/IND_off_target/2020_CRISPResso2/20200710_DE_1450_rhAMPSeq -d OT -m OT_only -p 0 -n -s 202006_DE_1450_rhAMPSeq_samples.csv -r 202006_1450_0000_ref_seqs.csv --CRISPRessoPooled

 

Output

YYYYMMDD_CRISPResso_OT_editing_summary.csv
A table displaying the total editing frequency per off-target, per sample.

 

Columns:

  • off_target = off-target names

  • guide_sequence = guide/off-target sequence (no PAM) including mismatches and bulges

  • pam = PAM sequence

  • sample = sample name

  • condition = control or edited

  • aligned_guide_seq = guide/off-target sequence without mismatches/bulges (the sequence used for alignment)

  • editing_freq = total editing frequncy per off-target, per sample

  • reads = number of edited reads

  • group = editing condition + sample name (same as in off-target composite figure below)

     

YYYYMMDD_CRISPResso_OTs_ttest.csv
This table displays summary statistics as well as the results of simple statistical tests comparing edited v. mock samples by off-target. Editing frequency (%) is pooled by off-target for all mock samples and all edited samples, and an independent, one-tailed t-test (95% confidence interval, Welch df) is performed to determine whether editing frequency is higher in edited samples compared to mock samples for each off-target. If fewer than 3 samples are included for either the edited or mock group, no statistical test is performed for the off-target. Likewise, if the variance is equal, no test is performed. In both cases, the p-value is NA in the output table.

In conjunction with the t.test, Cohen’s d (effect size) is also calculated using a 95% confidence interval (not pooled, Hedges correction applied).

 

Columns:

  • off_target = off-target names

  • edited/control_median = median value of edited/control editing frequencies (%) by off-target

  • edited/control_mean = mean value of edited/control editing frequencies (%) by off-target

  • edited/control_sd = standard deviation of edited/control editing frequencies (%) by off-target

  • ttest_p_value = independent, one-tailed t-test (95% confidence interval, Welch df) p-value comparing edited v. mock samples

  • eff_size = Cohen’s d (effect size) of difference between edited v. mock editing frequencies

  • significant = logical indicating whether the editing was significant for the off-target (for easy filtering)

     

off_target_Rplot.png/pdf
Example figure below. Figures are saved as individual .png files and pages in a single .pdf file.

The heatmap on the left displays the read coverage per amplicon, per sample. The off-target names and sequences in the center are followed by an * when the editing frequency in edited samples is significantly higher than that in mock/control samples. The editing frequency is displayed in the dotplot on the right.
 

Off-target composite figure details
Full plot

Read coverage heatmap (left)

Editing % dotplot (right)

 

off_target_summary_Rplot.png/pdf
Example figure below. Figures are saved as individual .png files and pages in a single .pdf file.

The heatmap on the left is the same as that of the (non-summary) figures above. The editing frequency summary statistics are displayed in the dotplot on the right; all mock samples are pooled, and all edited samples are pooled.

Summary off-target composite figure details
Full plot

Read coverage heatmap (left)

Summary Editing % dotplot (right)